Spark: Support aggregate pushdown for identity partition column GROUP BY by hemanthboyina · Pull Request #16176 · apache/iceberg

hemanthboyina · 2026-04-30T18:17:55Z

This PR enables aggregate pushdown for queries with GROUP BY on identity partition columns. Currently, Iceberg supports pushing down aggregates (COUNT, MIN, MAX) for queries without GROUP BY, computing results from file metadata instead of reading data files. However, when a query includes GROUP BY, the pushdown is disabled even when the GROUP BY columns are identity partition fields.

singhpk234 · 2026-05-02T02:26:57Z

+    Map<List<Object>, AggregateEvaluator> evaluatorsByPartition =
+        groupFilesByPartition(spec, groupByPositions, boundAggregates);


i am not confident this is correct, plus we are just checking the recent partitioning, a table could comprise of lot of different partition spec files which evolved across snapshots

Thanks for the review @singhpk234 You raised a valid point. the current implementation only considers the current partition spec and bails out for files from different specs. Will look into handling spec evolution properly and update the PR.

handled partition spec evolution changes, can you please review

anuragmantri

Thanks for the useful PR @hemanthboyina. Overall, it looks good to me. I made some suggestions.

anuragmantri · 2026-05-08T17:30:50Z

+    return -1;
+  }
+
+  private boolean allGroupByAreIdentityPartitionFields(Aggregation aggregation) {


allGroupByAreIdentityPartitionFields() and resolveGroupByFields() look very similar except

allGroupByAreIdentityPartitionFields additionally checks instanceof NamedReference

resolveGroupByFields additionally collects field IDs and fields into output lists
Can we merge these two?

Or maybe let canPushDownAggregation() allow group by and then have the checks in this merged method? What do you think?

anuragmantri · 2026-05-08T17:32:20Z

+    return true;
+  }
+
+  private static class ArrayStructLike implements StructLike {


Can we use AggregateEvaluator.ArrayStructLike instead? May have to make it package-private.

anuragmantri · 2026-05-08T17:34:43Z

@@ -568,11 +568,9 @@ public void testAggregationPushdownOnBucketedColumn() {
    sql(
        "CREATE TABLE %s (id BIGINT, struct_with_int STRUCT<c1:INT>) USING iceberg PARTITIONED BY (bucket(8, id))",
        tableName);
-


Nit: Unrelated whitespace change.

anuragmantri · 2026-05-08T17:48:26Z

@@ -909,4 +907,183 @@ public void testAggregatePushDownForIncrementalScan() {
    assertEquals(
        "min/max/count push down", expected2, rowsToJava(unboundedPushdownDs.collectAsList()));
  }
+
+  @TestTemplate
+  public void testGroupByIdentityPartitionColumnCountPushDown() {


Can we also verify the EXPLAIN string has the pushdown like other tests?

anuragmantri · 2026-05-08T17:48:56Z

+  }
+
+  @TestTemplate
+  public void testGroupByIdentityPartitionColumnWithMinMax() {


Same here, can we also have explain string verification?

Spark: Support aggregate pushdown for identity partition column GROUP BY

fde0869

github-actions Bot added the spark label Apr 30, 2026

fix complexity and checkstyle

751c3b1

singhpk234 requested a review from huaxingao May 2, 2026 02:24

singhpk234 reviewed May 2, 2026

View reviewed changes

fix spec evolution changes

8e38932

anuragmantri reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Support aggregate pushdown for identity partition column GROUP BY#16176

Spark: Support aggregate pushdown for identity partition column GROUP BY#16176
hemanthboyina wants to merge 3 commits intoapache:mainfrom
hemanthboyina:groupby_aggregate

hemanthboyina commented Apr 30, 2026

Uh oh!

singhpk234 May 2, 2026

Uh oh!

hemanthboyina May 2, 2026

Uh oh!

hemanthboyina May 3, 2026

Uh oh!

anuragmantri left a comment

Uh oh!

anuragmantri May 8, 2026

Uh oh!

anuragmantri May 8, 2026

Uh oh!

anuragmantri May 8, 2026

Uh oh!

anuragmantri May 8, 2026

Uh oh!

anuragmantri May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		Map<List<Object>, AggregateEvaluator> evaluatorsByPartition =
		groupFilesByPartition(spec, groupByPositions, boundAggregates);

Conversation

hemanthboyina commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anuragmantri left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants